Journal of Chemical Information and Modeling — Latest Matching Preprints

1

G-screen: Scalable Receptor-Aware Virtual Screening through Flexible Ligand Alignment

Jung, N.; Park, H.; Yang, J.; Seok, C.

2026-03-05 biophysics 10.64898/2026.03.03.707320 medRxiv

Top 0.1%

79.4%

Show abstract

Virtual screening has long been a central computational tool for rational ligand discovery, enabling the systematic prioritization of candidate molecules from large chemical libraries. Although docking and related approaches that explicitly account for receptor-ligand interactions have been developed and refined over several decades, achieving both reliable receptor-aware interaction modeling and computational scalability remains an open challenge, particularly for ultra-large chemical spaces. Ligand-based methods are fast and robust but do not explicitly incorporate receptor structure, whereas docking-based approaches model receptor-ligand interactions more directly at substantially higher computational cost. Here, we present G-screen, a freely available and scalable receptor-aware virtual screening framework designed for cases in which a reference protein-ligand complex structure is available. Instead of performing full docking, G-screen rapidly aligns candidate ligands to the reference ligand using a flexible global alignment algorithm (G-align) and evaluates receptor-aware pharmacophore interactions derived from the reference complex, thereby combining the efficiency of ligand-based alignment with explicit atomic-level interaction analysis. Benchmarking on DUD-E, LIT-PCBA, and MUV datasets demonstrates that G-screen achieves competitive discrimination and early enrichment relative to representative ligand-based and docking-based methods, while maintaining millisecond-scale per-molecule runtimes under multi-threaded execution. These results position G-screen as a practical and scalable receptor-aware screening strategy for efficiently filtering large chemical libraries when a reference complex structure is available. Scientific ContributionWe have developed a scalable virtual screening framework for efficiently filtering ultra-large chemical libraries using a flexible global alignment algorithm combined with receptor-aware pharmacophore evaluations. Despite explicitly capturing atomic-level interactions, the screening process using this method is highly efficient, maintaining millisecond-scale per-molecule runtimes under parallel execution. It achieves competitive discrimination and early enrichment, successfully bridging the speed of ligand-based approaches with the structural context of traditional docking.

2

DiffDock-Glide: a hybrid physics-based and data-driven approach to molecular docking

Herron, L.; Dakka, J.; Yao, K.; Shi, D.; Zhang, Y.; Jerome, S. V.

2025-06-04 molecular biology 10.1101/2025.06.02.657461 medRxiv

Top 0.1%

78.4%

Show abstract

Recent years have seen a rise in applications of deep learning to problems in the molecular sciences. Among them, the diffusion model DiffDock stands out as a method for docking small molecules into protein binding sites. But DiffDock struggles to compete with conventional docking methods, especially for targets outside its training set. We develop a hybrid model called DiffDock-Glide which addresses some shortcomings of deep learning docking methods: it uses a modified generative process to generate samples within a binding pocket and the confidence model is replaced with Glides post-docking minimization pipeline. We evaluate DiffDock-Glide on the Posebusters dataset and show improved sampling of near-native poses, especially for sequences without homologues in the training set. We also evaluate DiffDock-Glides performance in virtual screening compounds from the DUD-E dataset against receptor structures generated by AlphaFold2 and report enrichment values that broadly surpass those from traditional Glide.

3

Undersampling techniques for non-linear chemical space visualization

Surendran, A.; Zsigmond, K.; Quintana, R. A. M.

2025-07-07 bioinformatics 10.1101/2025.07.03.663077 medRxiv

Top 0.1%

77.9%

Show abstract

The visualization of high-dimensional chemical space is a critical tool for under-standing molecular diversity, structure-property relationships, and for guiding compound selection. However, the performance of non-linear dimensionality reduction (DR) techniques like t-Stochastic Neighborhood Embedding (t-SNE), Uniform Man-ifold Approximation and Projection (UMAP), and Generative Topographic Mapping (GTM) are often susceptible to the choice of hyperparameters, along with the high cost of their training for large datasets. In this study, we investigated the effect of undersampling methods on the choice of hyperparameter selection for these non-linear dimensionality reduction methods. Our results demonstrate that selecting small representative subsets of chemical data not only reduces computational costs associated with hyperparameter training but also serves as an innovative means to train non-linear DR methods, leading to projections that better preserve the local structure within the chemical space.

4

MFPLI: A Computational Framework for Assessing Biological Authenticity of Protein-Ligand Interactions Using Molecular Fingerprints and Structural Features

Zhang, H.; Zheng, J.; Li, H.; Wan, S.; Guan, G.; Li, B.; Liu, L.; He, W.

2025-06-19 molecular biology 10.1101/2025.06.19.659858 medRxiv

Top 0.1%

77.1%

Show abstract

Traditional computational drug discovery approaches struggle to accurately evaluate the biological authenticity of protein-ligand binding conformations due to inherent limitations in empirical scoring functions and force field approximations. This study proposes MFPLI - a deep learning framework integrating multimodal physicochemical features to systematically assess the biological authenticity alignment between molecular docking poses and true co-crystal structures. By establishing a continuous surface characterization system for protein-ligand interfaces, we concurrently incorporate geometric curvature features (radius, shape index) and chemical interaction fields (electrostatic potential, hydrogen-bond networks, hydrophobicity gradients). A contrastive learning architecture based on Siamese equivariant graph neural networks was developed to enable discriminative analysis between co-crystal conformations and parameter-perturbed pseudo-conformations generated through inverse docking. The five-channel fusion model demonstrates robust performance on the time-split PoseBuster validation set (AUC=0.91), with predicted Euclidean distance deviation ({Delta}E) effectively distinguishing native co-crystal conformations from aberrant docking poses in 80% of samples. Notably, 71% of {Delta}E-negative samples concentrate within the [-0.3, 0] interval, reflecting physical consistency between model predictions and conformational transition processes. This framework establishes a novel paradigm for biological authenticity assessment in virtual screening for computer-aided drug discovery through synergistic modeling of surface topology and interaction chemistry.

5

qcMol: a large-scale dataset of 1.2 million molecules with high-quality quantum chemical annotations for molecular representation learning

Wang, H.; Zhang, Z.; Gong, H.

2025-09-12 biophysics 10.1101/2025.09.07.674462 medRxiv

Top 0.1%

72.3%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWRecent advancements in deep learning have greatly prompted the de novo design of drugs and materials. Previous studies have shown that a well-designed molecular representation is critical for improving the accuracy of deep-learning-based molecular property prediction methods. However, the lack of large-scale data enriched with detailed physicochemical information hinders effective learning of an informative molecular representation. To fill this data gap, we introduce qcMol, a dataset consisting of 1.2 million molecules from 95 datasets with high-quality quantum chemical annotations, to facilitate molecular representation learning as well as downstream molecular property prediction. Chemicals in this dataset include drug-like compounds, metabolites and molecules with matched experimental data, covering 247,448 kinds of scaffolds and a broad spectrum of molecular sizes. Each compound in qcMol is annotated with detailed quantum chemical information, obtained through reliable quantum chemical calculations based on B3LYP-D3/def2-SV(P)//GFN2-xTB as well as the follow-up wave function post-analysis. These features are organized into multiple formats, allowing for flexible integration into diversified molecular representation learning frameworks. The broad data distribution, comprehensive quantum chemical annotations and flexible data formats jointly enable qcMol to serve as the pre-training resource as well as the benchmark test set for deep learning models, benefiting the practical in silico drug discovery. qcMol is freely accessible from https://structpred.life.tsinghua.edu.cn/qcmol/.

6

Dynamic consensus pocket detection across molecular dynamics ensembles reveals persistent and transient druggable sites

Marigliani, G.; Petrizzelli, F.; Mangoni, M.; Bianco, S. D.; Orzella, I.; Guzzi, P. H.; Caputo, V.; Biagini, T.; Mazza, T.

2026-07-02 bioinformatics 10.64898/2026.06.27.734992 medRxiv

Top 0.1%

72.3%

Show abstract

The traditional 'one drug, one target' paradigm assumes that drugs interact with a single specific binding site. Modern pharmacology has proven this definition overly simplistic and, instead, recognizes that drugs operate within complex biological systems and often interact with multiple targets. In this context, proteins cannot be viewed as possessing a single functional binding site, but rather as dynamic entities capable of accommodating ligands at multiple regions, including transient and cryptic pockets. Here, we review and repurpose representative pocket detection tools across geometry-based, energy-based, and machine/deep learning approaches, originally designed to work on static conformations, to evaluate their agreement on molecular dynamics-derived conformational ensembles. Using GLUT1 protein as a dynamic transporter model and Aldose reductase as a cryptic-pocket reference system, we combine inter-tool concordance, HDBSCAN-based spatial clustering, volumetric IoU analysis, and temporal persistence scoring. Our results show that different algorithmic classes capture complementary aspects of pocket dynamics, with energy-based methods showing stronger sensitivity to transient cryptic regions and geometry-based approaches depending more strongly on pre-formed cavities. This work proposes a consensus-oriented framework for identifying conserved and transient druggable pockets in dynamic protein systems.

7

Enhanced Thompson Sampling by Roulette Wheel Selection for Screening Ultra-Large Combinatorial Libraries

Zhao, H.; Nittinger, E.; Tyrchan, C.

2024-05-21 bioinformatics 10.1101/2024.05.16.594622 medRxiv

Top 0.1%

71.3%

Show abstract

Chemical space exploration has gained significant interest with the increase in available building blocks, which enables the creation of ultra-large virtual libraries containing billions or even trillions of compounds. However, the challenge of selecting most suitable compounds for synthesis arises, and one such challenge is hit expansion. Recently, Thompson sampling, a probabilistic search approach, has been proposed by Walters et al. to achieve efficiency gains by operating in the reagent space rather than the product space. Here, we aim to address some of its shortcomings and propose optimizations. We introduce a warmup routine to ensure that initial probabilities are set for all reagents with a minimum number of molecules evaluated. Additionally, a roulette wheel selection is proposed with adapted stop criteria to improve sampling efficiency, and belief distributions of reagents are only updated when they appear in new molecules. We demonstrate that a 100% recovery rate can be achieved by sampling 0.1% of the fully enumerated library, showcasing the effectiveness of our proposed optimizations.

8

TwinSAR: An Adaptive Kernel-based Algorithm with logit-transformed Z-score Filtering for Chemical Twin Detection in Large-scale Virtual Screening

Haris Kulosmanovic, H.; Uguz, C.; DURDAGI, S.

2026-05-15 bioinformatics 10.64898/2026.05.12.724687 medRxiv

Top 0.1%

70.0%

Show abstract

Molecular similarity searching is a workhorse of cheminformatics, but the dominant Tanimoto/topological-fingerprint paradigm has well-known blind spots. It is highly sensitive to molecular size, suffers from steep activity cliffs, and frequently fails to retrieve scaffold-hopping bioisosteres. A complementary descriptor that has received comparatively little attention is global elemental composition. Despite the conceptual simplicity of comparing molecules by their elemental ratios, no widely deployed method exists for the statistically rigorous identification of "chemical twins" defined by stoichiometric proximity. We address this gap with TwinSAR (Stoichiometric Analysis and Retrieval), an adaptive kernel-based algorithm that combines three methodological innovations: (i) binary fingerprint blocking that partitions molecule by element-presence patterns and bounds the cost of all-pairs comparison from O(NM) to O({sum}nimi) enabling million/billion-scale searches; (ii) a per-block adaptive radial basis function (RBF) kernel whose precision parameter is calibrated independently for each fingerprint block via the median heuristic, providing fair similarity comparison across chemical sub-spaces of vastly different density; and (iii) a logit-transformed Z-score filter that maps bounded RBF scores onto an unbounded scale, allowing high-similarity pairs to be prioritized relative to the empirical score distribution of their own fingerprint block. TwinSAR is offered in two operating modes: (i) a deterministic BULK mode for exact reproducibility; and (ii) a stochastic FAST mode that achieved a 3.29x wall-clock speed-up in the present benchmark while preserving the similar unique-query and unique-target coverage. Statistical validation showed that detected twin pairs are 12.7x more similar in absolute ratio space than block-matched random pairs (p < 0.001), while a column-permutation negative control returned a median of zero spurious twins across three independent permutations. A controlled benchmark further established that an 8-element representation (single-element heavy-atom ratios) is sensitivity-equivalent to a comprehensive 254-element representation while running 3.55x faster. As a case study, TwinSAR was deployed in an end-to-end virtual screening pipeline against the BCL-2 target protein, where it reduced a 327,071-compound commercial library to a 390 focused candidate panel. The chemical interpretability of the retrieved twins is illustrated by their structural diversity around conserved heavy-atom skeletons. TwinSAR therefore provides a fast, conformation-free, and statistically principled prefilter that is fully orthogonal to topological fingerprints.

9

SLOGEN: A Structure-based Lead Optimization Model Unifying Fragment Generation and Screening

Yang, B.; Xu, Y.; Xiang, C.; Zhu, Y.; Li, T.; Sinitskiy, A.; Li, J.

2025-10-15 bioinformatics 10.1101/2025.10.14.682343 medRxiv

Top 0.1%

69.2%

Show abstract

Lead optimization plays an important role in preclinical drug discovery. While deep learning has accelerated this process, structure-based approaches that leverage 3D protein-ligand information remain underexplored. Existing models could improve predicted affinity but often yield synthetically inaccessible compounds, whereas screening-based methods limit chemical novelty by relying on fixed fragment libraries. To bridge the gap, we introduce Slogen--a Structure-based Lead Optimization algorithm unifying fragment Generation and screENing. To achieve this, Slogen integrates a transformer-based variational autoencoder, pretrained on the BindingNet v2 dataset, with an E(3)-equivariant graph neural network that models 3D protein-fragment interactions. This unified framework enables both fragment generation and similarity-based screening, simultaneously addressing synthetic tractability and structural diversity. Benchmarking study shows that Slogen matches or surpasses state-of-the-art methods while exploring broader chemical space. Case studies on the Smoothened and D1 dopamine receptors demonstrate its capacity to design high-affinity, drug-like molecules, providing a practical method for structure-guided lead optimization.

10

SHARP: Generating Synthesizable Molecules via Fragment-based Hierarchical Action-space Reinforcement Learning for Pareto Optimization

Kim, J.; Ryu, S.; Park, H.; Seok, C.

2025-07-23 bioinformatics 10.1101/2025.07.18.665529 medRxiv

Top 0.1%

69.1%

Show abstract

Designing drug-like molecules that satisfy multiple objectives--such as high binding affinity, synthesizability, and drug-likeness--poses a complex global optimization problem over an astronomically large chemical space. Existing deep learning-based molecular generative models often treat this task as distribution modeling, relying on atom-level autoregressive actions with less consideration of explicit optimization feedback. Consequently, they frequently generate invalid structures, converge to local optima, or produce synthetically infeasible candidates. Here, we introduce SHARP (Synthesizable Hierarchical Action-space Reinforcement learning for Pareto optimization), a molecular generator that addresses these limitations via a fragment-based hierarchical action space and reinforcement learning. SHARP ensures synthetic accessibility by applying action masks guided by a pretrained Synthesizability Estimation Model (SEM). The reinforcement learning (RL) policy is trained using a composite reward function integrating docking scores, pharmacophore matching, and solvent accessibility to generate functionally relevant and experimentally tractable molecules. Furthermore, across four lead optimization tasks--fragment growing, linker design, scaffold hopping, and sidechain decoration--on a diverse receptor set, SHARP consistently outperforms prior methods in producing molecules at high affinity and synthesizability. These results demonstrate that reinforcement learning with a chemically intuitive action space design can be an effective solution to the optimization challenges in AI-driven drug discovery, offering a robust framework for rational molecular design in structure-based applications.

11

A Diffusion-Based Framework for Designing Molecules in Flexible Protein Pockets

Wang, J.; Dokholyan, N. V.

2025-05-30 bioinformatics 10.1101/2025.05.27.656443 medRxiv

Top 0.1%

69.0%

Show abstract

The design of molecules for flexible protein pockets represents a significant challenge in structure-based drug discovery, as proteins often undergo conformational changes upon ligand binding. While deep learning-based approaches have shown promise in molecular generation, they typically treat protein pockets as rigid structures, limiting their ability to capture the dynamic nature of protein-ligand interactions. Here, we introduce YuelDesign, a novel diffusion-based framework specifically developed to address this challenge. YuelDesign employs a new protein encoding scheme with a fully connected graph representation to encode protein pocket flexibility, a systematic denoising process that refines both atomic properties and coordinates, and a specialized bond reconstruction module tailored for de novo generated molecules. Our results demonstrate that YuelDesign generates molecules with favorable drug-likeness and low synthetic complexity. The generated molecules also exhibit diverse chemical functional groups, including some not even present in the training set. Redocking analysis reveals that the generated molecules exhibit docking energies comparable to native ligands. Additionally, a detailed analysis of the denoising process shows how the model systematically refines molecular structures through atom type transitions, bond dynamics, and conformational adjustments. Overall, YuelDesign presents a versatile framework for generating novel molecules tailored to flexible protein pockets, with promising implications for drug discovery applications.

12

An AI-Driven Framework for Discovery of BACE1 Inhibitors for Alzheimer's Disease

Xie, E.; Hasegawa, K.; Kementzidis, G.; Papadopoulos, E.; Aktas, B. H.; Deng, Y.

2024-05-15 biochemistry 10.1101/2024.05.15.594361 medRxiv

Top 0.1%

66.9%

Show abstract

Alzheimers Disease (AD) is a progressive neurodegenerative disorder that affects over 51 million individuals globally. The {beta}-secretase (BACE1) enzyme is responsible for the production of amyloid beta (A{beta}) plaques in the brain. The accumulation of A{beta} plaques leads to neuronal death and the impairment of cognitive abilities, both of which are fundamental symptoms of AD. Thus, BACE1 has emerged as a promising therapeutic target for AD. Previous BACE1 inhibitors have faced various issues related to molecular size and blood-brain barrier permeability, preventing any of them from maturing into FDA-approved AD drugs. In this work, a generative AI framework is developed as the first AI application to the de novo generation of BACE1 inhibitors. Through a simple, robust, and accurate molecular representation, a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP), and a Genetic Algorithm (GA), the framework generates and optimizes over 1,000,000 candidate inhibitors that improve upon the bioactive and pharmacological properties of current BACE1 inhibitors. Then, the molecular docking simulation models the candidate inhibitors and identifies 14 candidate drugs that exhibit stronger binding interactions to the BACE1 active site than previous candidate BACE1 drugs from clinical trials. Overall, the framework successfully discovers BACE1 inhibitors and candidate AD drugs, accelerating the developmental process for a novel AD treatment.

13

A universal model for drug-receptor interactions

Menezes, F.; Wahida, A.; Popowicz, G. M.

2025-08-02 bioinformatics 10.1101/2025.08.01.668090 medRxiv

Top 0.1%

66.8%

Show abstract

The modern AI models promise decoding of the genomic landscape that holds, in principle, the information required for rational therapeutic design. Genes encode proteins whose functions are mediated by their three-dimensional structures via bonded and non-bonded interactions. Since the late 1970s, the advent of macromolecular crystallography inspired the notion that structural knowledge alone could enable a "lock-and-key" approach to drug design. However, this framework has failed to catalyze a step-change in generating new drugs. Drug discovery continues to depend on costly, resource-intensive, and largely serendipitous screening campaigns that probe only an infinitesimal fraction of the drug-like chemical space. Despite some success cases, our understanding of, and reasoning from, non-bonded interaction chemistry is still too limited for generalized applicability. Furthermore, though structural databases contain hundreds of thousands of entries, a strong historical bias pervades protein-drug structures, hindering reliable advances through data science. Here, we show how a simple machine learning model successfully infers the principles of non-bonded chemical interactions in the drug-receptor space. A reductionist approach to the training data led to a model generalizing drug-target interactions, minimizing memorization frequently seen in large, structural models. We show how the model can infer complex interactions incomprehensible for classical physics-based force field models and approach a quantum level of understanding. The model was validated by retrospective and prospective, real-life problems in drug optimization. when targeting a challenging protein-protein interface. Our approach offers a simple, interpretable and explainable way to steer drug optimization and condition complex generative models to greatly accelerate, diversify and enhance drug discovery.

14

Convex-PLR - Revisiting affinity predictions and virtual screening using physics-informed machine learning

Kadukova, M.; Chupin, V.; Grudinin, S.

2021-09-15 bioinformatics 10.1101/2021.09.13.460049 medRxiv

Top 0.1%

66.6%

Show abstract

Virtual screening is an essential part of the modern drug design pipeline, which significantly accelerates the discovery of new drug candidates. Structure-based virtual screening involves ligand conformational sampling, which is often followed by re-scoring of docking poses. A great variety of scoring functions have been designed for this purpose. The advent of structural and affinity databases and the progress in machine-learning methods have recently boosted scoring function performance. Nonetheless, the most successful scoring functions are typically designed for specific tasks or systems. All-purpose scoring functions still perform poorly on the virtual screening tests, compared to precision with which they are able to predict co-crystal binding poses. Another limitation is the low interpretability of the heuristics being used. We analyzed scoring functions performance in the CASF benchmarks and discovered that the vast majority of them have a strong bias towards predicting larger binding interfaces. This motivated us to develop a physical model with additional entropic terms with the aim of penalizing such a preference. We parameterized the new model using affinity and structural data, solving a classification problem followed by regression. The new model, called Convex-PLR, demonstrated high-quality results on multiple tests and a substantial improvement over its predecessor Convex-PL. Convex-PLR can be used for molecular docking together with VinaCPL, our version of AutoDock Vina, with Convex-PL integrated as a scoring function. Convex-PLR, Convex-PL, and VinaCPL are available at https://team.inria.fr/nano-d/convex-pl/.

15

Structure-guided compound prioritization strategy for virtual screening identifies putative binders for the nuclear receptor LRH-1

Chang-Gonzalez, A. C.; Campbell, A. N.; Bell, E. W.; Blind, R.; Meiler, J.

2026-06-07 bioinformatics 10.64898/2026.06.04.730240 medRxiv

Top 0.1%

66.1%

Show abstract

Compound ranking in structure-based virtual screening notoriously yields highly ranked false positive binders due to variable poses or biases in scoring terms. We developed a compound prioritization strategy that utilizes sampled docked poses from contrasting docking approaches (targeted physics-based docking and blind docking with a generative model) against multiple models of the target protein to train a multi-layer perceptron (MLP). The model predicts binders at the orthosteric ligand-binding pocket of the nuclear receptor LRH-1 (NR5A2). Our approach circumvents the reliance on a single docked pose for scoring compounds or individual scoring metrics for compound ranking. In a separate benchmarking set, we observed that the MLP identifies known binders that are chemically dissimilar from the compounds in the training set and is sensitive to single scaffold modifications, making it a potential tool for lead optimization. We applied our strategy to a prospective virtual screening campaign, which resulted in the discovery of four putative LRH-1 binders. We found that a combination of scoring and prediction metrics enriches for the hit compounds across library sizes. In all, this implementation presents a method to leverage structural and experimental data to aid virtual screening for a challenging protein target.

16

Integrating computational chemistry and machine learning to predict KRAS mutation-induced resistance

Mizgalska, K.; Urbaniak, K.; Imbody, D. J.; Haura, E. B.; Guida, W. C.; Branciamore, S.; Karolak, A.

2026-04-11 biophysics 10.64898/2026.04.10.717640 medRxiv

Top 0.1%

65.8%

Show abstract

Mutation-induced drug resistance is a major contributor to the failure of targeted cancer therapies, particularly in tumors driven by mutations in the KRAS oncogene. Although covalent inhibitors effectively target KRAS G12C, secondary mutations such as G12C/Y96C, G12C/Y96S, and G12C/Y96D lead to resistance despite leaving the covalent attachment site intact. To predict these resistance outcomes, we developed a computational framework that integrates molecular dynamics-derived structural, energetic, thermodynamic, and contact-based descriptors with machine learning. Features extracted from simulations of treatment-sensitive and treatment-resistant KRAS mutants were used to train logistic regression, random forest, support vector machine, and Bayesian Network classifiers, achieving average accuracies above 90%. Solvent-accessible surface area variability, Lennard-Jones 1,4 energy, mean square displacement, and root mean square fluctuation emerged as the most discriminatory features. Residues G10, E62, and H95 showed the highest predictive value. This approach highlights conformational and solvent-exposure changes as central drivers of KRAS drug resistance and provides a generalizable workflow for other clinically relevant mutant targets. Author SummaryMutation-induced resistance is a common challenge across many cancer types and is often associated with aggressive tumor progression and poor therapeutic response. Investigating the dynamic properties of proteins harboring such mutations provides valuable insights into the structural and functional consequences of these alterations, thereby helping to elucidate the mechanisms of drug resistance. Machine learning algorithms are particularly effective at uncovering complex patterns within high-dimensional data, such as molecular dynamics simulation trajectories. Integrating these algorithms with analysis of protein dynamics holds significant potential to aid in drug discovery challenges by reducing both time and resource demands while increasing the likelihood of identifying effective therapeutic candidates. As a proof of concept, we developed a computational framework that integrates molecular dynamics-derived molecular features with machine learning to distinguish treatment-sensitive from treatment-resistant KRAS mutants. KRAS is known for drug resistance arising from secondary mutations that disrupt inhibitor binding despite intact covalent attachment sites. The models achieved over 90% accuracy and identified solvent-exposure and conformational changes at residues G10, E62, and H95 as key predictors of treatment resistance. This workflow offers a generalizable strategy for understanding and forecasting mutation-induced resistance.

17

Small molecules targeting the structural dynamics of AR-V7 partially disordered protein using deep learning and physics based models

Karatzas, P.; Brotzakis, Z. F.; Sarimveis, H.

2024-02-28 bioinformatics 10.1101/2024.02.23.581804 medRxiv

Top 0.1%

65.5%

Show abstract

Partially disordered proteins can contain both stable and unstable secondary structure segments and are involved in various (mis)functions in the cell. The extensive conformational dynamics of partially disordered proteins scaling with extent of disorder and length of the protein hampers the efficiency of traditional experimental and in-silico structure-based drug discovery approaches. Therefore new efficient paradigms in drug discovery taking into account conformational ensembles of proteins need to emerge. In this study, using as a test case the AR-V7 transcription factor splicing variant related to prostate cancer, we present an automated methodology that can accelerate the screening of small molecule binders targeting partially disordered proteins. By swiftly identifying the conformational ensemble of AR-V7, and reducing the dimension of binding-sites by a factor of 90 by applying appropriate physicochemical filters, we combine physics based molecular docking and multi-objective classification machine learning models that speed up the screening of thousands of compounds targeting AR-V7 multiple binding sites. Our method not only identifies previously known binding sites of AR-V7, but also discovers new ones, as well as increases the multi-binding site hit-rate of small molecules by a factor of 10 compared to naive physics-based molecular docking.

18

PointVS: A Machine Learning Scoring Function that Identifies Important Binding Interactions

Scantlebury, J.; Vost, L.; Carbery, A.; Hadfield, T. E.; Turnbull, O. M.; Brown, N.; Chenthamarakshan, V.; Das, P.; Grosjean, H.; von Delft, F.; Deane, C. M.

2022-10-31 bioinformatics 10.1101/2022.10.28.511712 medRxiv

Top 0.1%

65.4%

Show abstract

Over the last few years, many machine learning-based scoring functions for predicting the binding of small molecules to proteins have been developed. Their objective is to approximate the distribution which takes two molecules as input and outputs the energy of their interaction. Only a scoring function that accounts for the interatomic interactions involved in binding can accurately predict binding affinity on unseen molecules. However, many scoring functions make predictions based on dataset biases rather than an understanding of the physics of binding. These scoring functions perform well when tested on similar targets to those in the training set, but fail to generalise to dissimilar targets. To test what a machine learning-based scoring function has learnt, input attribution--a technique for learning which features are important to a model when making a prediction on a particular data point--can be applied. If a model successfully learns something beyond dataset biases, attribution should give insight into the important binding interactions that are taking place. We built a machine learning-based scoring function that aimed to avoid the influence of bias via thorough train and test dataset filtering, and show that it achieves comparable performance on the CASF-2016 benchmark to other leading methods. We then use the CASF-2016 test set to perform attribution, and find that the bonds identified as important by PointVS, unlike those extracted from other scoring functions, have a high correlation with those found by a distance-based interaction profiler. We then show that attribution can be used to extract important binding pharmacophores from a given protein target when supplied with a number of bound structures. We use this information to perform fragment elaboration, and see improvements in docking scores compared to using structural information from a traditional, data-based approach. This not only provides definitive proof that the scoring function has learnt to identify some important binding interactions, but also constitutes the first deep learning-based method for extracting structural information from a target for molecule design.

19

Predicting Permeation of Compounds across the Outer Membrane of P. aeruginosa Using Molecular Descriptors: Advantages and Limitations

Manrique, P. D.; Leus, I.; Lopez, C. A.; Mehla, J.; Malloci, G.; Gervasoni, S.; Vargiu, A.; Kinthada, R.; Herndon, L.; Hengartner, N.; Walker, J. K.; Rybenkov, V.; Ruggerone, P.; Zgurskaya, H.; Gnanakaran, S.

2023-09-05 biophysics 10.1101/2023.09.02.555818 medRxiv

Top 0.1%

65.3%

Show abstract

The ability of Gram-negative pathogens to adapt and protect themselves against antibiotics is a growing threat to public health. The low permeability of the outer membrane (OM) in combination with effective multidrug efflux pumps, constitute the two main antibiotic resistance mechanisms. Though much efforts have been devoted to discover new antibiotics that can bypass these defense mechanisms, no new antibiotic classes have been introduced into clinics in the last 35 years. Models that identify specific descriptors of molecular properties and predict the likelihood that a given compound is capable of successfully permeate the OM and inhibit bacterial growth while avoiding efflux could facilitate the discovery of novel classes of antibiotics. Here we evaluate 174 molecular descriptors of 1260 antimicrobial compounds and study their correlations with antibacterial activity in Gram-negative Pseudomonas aeruginosa. While part of these descriptors are computed using traditional approaches based on the physicochemical properties intrinsic to the compounds, ensemble docking and all-atom molecular dynamics (MD) simulations are used to derive additional bacterium-specific mechanistic properties. Descriptors of compound permeation across the OM were calculated using all-atom MD simulations of the compounds in different subregions of the OM model. Descriptors of interactions with efflux pumps were calculated from ensemble docking of compounds targeting specific binding pockets of MexB, the major efflux transporter of P. aeruginosa. Using these descriptors and the measured antibacterial inhibitory concentrations of compounds, we design and implement a statistical protocol to identify a subset of the molecular properties that are predictive of whether a given compound is a strong or weak permeator across the Gram-negative OM. Our results indicate that 88.4% of the compounds that show measurable antibacterial activity, follow very consistent rules of permeation, which highlight the critical role that the interaction between the compound and the OM have at predicting permeation. The remaining 11.6% of the compounds, although less predictive, are characterized by distinctive structural markers that can be used to minimize classification errors. An implementation of the permeation rules and the structural markers uncovered in our study is shown, and it demonstrates the accuracy of our approach in a set of previously unseen compounds. Taken together, our analysis sheds new light on the key molecular properties that drug candidates should have in order to be effective at OM permeation/inhibition of P. aeruginosa, and opens the gate to similar data-driven studies in other Gram-negative pathogens.

20

Best practices to cluster large molecular libraries

Lope Perez, K.; Miranda Quintana, R. A.

2025-12-01 bioinformatics 10.1101/2025.11.28.691214 medRxiv

Top 0.1%

65.1%

Show abstract

BitBIRCH is a novel clustering algorithm that enables the analysis of extremely large molecular libraries; however, its performance can be hindered by an excessive number of singletons or the formation of disproportionately large clusters. Here, we present a data-driven strategy to identify optimal BitBIRCH parameters that mitigate these limitations. Using the ChEMBL34 library as a case study (with additional datasets reported in the Supporting Information), we show that similarity thresholds between three and four standard deviations above the global mean provide a balanced trade-off between cluster count and medoid similarity. These values are efficiently approximate with the iSIM and iSIM-sigma frameworks. For the branching factor, values as high as computationally feasible are recommended, as increasing it to 1024 substantially reduced the number of singletons. We further introduce an iterative re-clustering procedure wherein the similarity threshold can be adjusted to merge related subclusters and singletons from the initial clustering, providing user-defined control over the extent of cluster fusion. This work provides practical guidelines to enhance the robustness and usability of BitBIRCH for large-scale molecular clustering.